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Abstract. We present a brief survey of existing mistake bounds and 
introduce novel bounds for the Perceptron or the kernel Perceptron al- 
gorithm. Our novel bounds generalize beyond standard margin-loss type 
bounds, allow for any convex and Lipschitz loss function, and admit a 
very simple proof. 

1 Introduction 

The Perceptron algorithm belongs to the broad family of on-line learning 
algorithms (see Cesa-Bianchi and Lugosi [2006] for a survey) and admits 
a large number of variants. The algorithm learns a linear separator by 
processing the training sample in an on-line fashion, examining a single 
example at each iteration [Rosenblatt, 1958]. At each round, the cur- 
rent hypothesis is updated if it makes a mistake, that is if it incorrectly 
classifies the new training point processed. The full pseudocode of the 
algorithm is provided in Figure 1. In what follows, we will assume that 
wo = and rj = 1 for simplicity of presentation, however, the more gen- 
eral case also allows for similar guarantees which can be derived following 
the same methods we are presenting. 

This paper briefly surveys some existing mistake bounds for the Percep- 
tron algorithm and introduces new ones which can be used to derive 
generalization bounds in a stochastic setting. A mistake bound is an up- 
per bound on the number of updates, or the number of mistakes, made 
by the Perceptron algorithm when processing a sequence of training ex- 
amples. Here, the bound will be expressed in terms of the performance 
of any linear separator, including the best. Such mistake bounds can be 
directly used to derive generalization guarantees for a combined hypoth- 
esis, using existing on-line-to-batch techniques. 



2 Separable case 

The seminal work of Novikoff [1962] gave the first margin-based bound 
for the Perceptron algorithm, one of the early results in learning theory 
and probably one of the first based on the notion of margin. Assuming 
that the data is separable with some margin p, Novikoff showed that the 
number of mistakes made by the Perceptron algorithm can be bounded 
as a function of the normalized margin p/R, where R is the radius of the 
sphere containing the training instances. We start with a Lemma that can 
be used to prove Novikoff's theorem and that will be used throughout. 



Perceptron(wo) 

1 wi <— wo > typically wo = 

2 for t <- 1 to T do 

3 RECEIVE(x t ) 

4 y t <- sgn(w t • x t ) 

5 RECEIVE(j/ t ) 

6 if (y t =fi yi) then 

7 wt+i <- wt+ j/txt > more generally r]yiX-t, t] > 0. 

8 else Wt + i ^— w t 

9 return wt+i 



Fig. 1. Perceptron algorithm [Rosenblatt, 1958]. 



Lemma 1. Let I denote the set of rounds at which the Perceptron algo- 
rithm makes an update when processing a seguence of training instances 



Then, the following inequality holds: 
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Proof. The inequality holds using the following sequence of observations, 

(definition of updates) 
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(telescoping sum, wo = 0) 
(telescoping sum, wo = 0) 

(definition of updates) 



^2ytw 4 • xt +||xt| 



The final inequality uses the fact that an update is made at round t only 
when the current hypothesis makes a mistake, that is, yt(w t -x t ) < 0. □ 

The lemma can be used straightforwardly to derive the following mistake 
bound for the separable setting. 

Theorem 1 ([Novikoff, 1962]). Let xi, . . . ,xt £ be a sequence of 
T points with ||xt|| < r for all t 6 [1, T], for some r > 0. Assume that 



there exist p > and v € WL N , v / 0, such that for all t £ p < 

vt fcr" ■ Then, the number of updates made by the Perceptron algorithm 
when processing xi, . . . , xt is bounded by r 2 /p 2 . 

Proof. Let / denote the subset of the T rounds at which there is an up- 
date, and let M be the total number of updates, i.e., |/| = M. Summing 
up the inequalities yields: 

" V " tei y te/ 

where the second inequality holds by the Cauchy-Schwarz inequality, the 
third by Lemma 1 and the final one by assumption. Comparing the left- 
and right-hand sides gives vM < r/p, that is, M < r 2 /p 2 . □ 

3 Non-separable case 

In real-world problems, the training sample processed by the Perceptron 
algorithm is typically not linearly separable. Nevertheless, it is possible 
to give a margin-based mistake bound in that general case in terms of 
the radius of the sphere containing the sample and the margin-based 
loss of an arbitrary weight vector. We present two different types of 
bounds: first, a bound that depends on the Li-norm of the vector of 
p-margin hinge losses, or the vector of more general losses that we will 
describe, next a bound that depends on the Z/2-norm of the vector of 
margin losses, which extends the original results presented by Freund 
and Schapire [1999]. 

3.1 Ii-norm mistake bounds 

We first present a simple proof of a mistake bound for the Perceptron 
algorithm that depends on the Li-norm of the losses incurred by an 
arbitrary weight vector, for a general definition of the loss function that 
covers the p-margin hinge loss. The family of admissible loss functions is 
quite general and defined as follows. 

Definition 1 (7-admissible loss function). A 7-admissible loss func- 
tion 4> 1 : R — > R+ satisfies the following conditions: 

1. The function <^) 7 is convex. 

2. 4>~f is non-negative: Vx 6 R, cj)-y(x) > 0. 

3. At zero, the </> 7 is strictly positive: </> 7 (0) > 0. 

4- 4>-y * s r y-Lipschitz: \cj>-y(x) — (f> 7 (y)\ < 7]x — y\, for some 7 > 0. 

These are mild conditions satisfied by many loss functions including the 
hinge-loss, the squared hinge-loss, the Huber loss and general p-norm 
losses over bounded domains. 



Theorem 2. Let I denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances xi, . . . ,xt £ R . For any vector u £ R^ with ||u|| < 1 and any 
"/-admissible loss function 7 , consider the vector of losses incurred by 
u: L^ T (u) = [0 7 (j/t(u • x «))] tsJ - Then, the number of updates Mt = |/| 
made by the Perceptron algorithm can be bounded as follows: 



M - z is*.*, ztstII^WII 1 + Z7m ,E Wl a • W 



7>0,||u||<l </) 7 (0) ^ 7 (O) 



It! 



If we further assume that ||xt|| < r /or all t £ [1, T], for some r > 0, this 
implies 

Mr< inf fj^ + ./ fepr V. (2) 



7>o,||u||<i \</> 7 (0) y </> 7 (0) 

Proof. For all 7 > and u with ||u|| < 1, the following statements hold. 
By convexity of <j) 1 we have ~^2 teI 7 (j/tu ■ xt) > </> 7 (u ■ z), where 
z = -p^ Yltel ^ tX '- Then, by using the Lipschitz property of <j) 1 we have, 

</> 7 (u • z) = 7 (u • z) — 7 (0) + 7 (0) 

= -|0 7 (O) - <^y(u-z)| + 7 (O) 
> - 7 |u-z| +0 7 (O). 

Combining the two inequalities above and multiplying both sides by Mt 
implies 

M(/> 7 (0) < ^0 7 (y t u- xt) + 7I • x t j . 

Finally, using the Cauchy-Schwartz inequality and Lemma 1 yields 



^3/ t u- xt = u- (X^* Xf ) ^ H u ll 5Z^ fX * - t u2 



x i 



which completes the proof of the first statement after re-arranging terms. 
If it is further assumed that ||xt|| < r for all t € I, then this implies 
M0 7 (O) — ry/M — Y2tei 07(j/t u ' x t) < 0. Solving this quadratic expression 
in terms of \fM proves the second statement. □ 

It is straightforward to see that the p- margin hinge loss <f)p( x ) = (1 — 
x/p)+ is (l/p)-admissible with P (O) = 1 for all p, which gives the fol- 
lowing corollary. 

Corollary 1. Let I denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances Xi, . . . , xt £ WL N . For any p > and any u £ R^ with ||u|| < 1, 
consider the vector of p-hinge losses incurred by u: L p (u) = [(1 — 
3/t(u-*t) ) + ] . Then, the number of updates Mt = |/| made by the Per- 
ceptron algorithm can be bounded as follows: 



v 'E, e ,iwi 2 

M T < inf ||L„(u)||i + -* . (3) 

p>0||u||<l" " p 



// we further assume that ||x t || < r for all t G [1, T], for some r > 0, this 
implies 



Mt < inf (- + V II^p (u) || i^) 
p>o,||u||<i Vp / 



(4) 



The mistake bound (3) appears already in Cesa-Bianchi et al. [2004] but 
we could not find its proof either in that paper or in those it references 
for this bound. 

Another application of Theorem 2 is to the squared-hinge loss <j> p (x) — 
(1 — xj p)\. Assume that ||x|| < r, then the inequality ||j/(u ■ x)|| < 
ll u llll x ll < r implies that the derivative of the hinge-loss is also bounded, 
achieving a maximum absolute value |<^p(r")| ~ lf(p — -01 — fS- Thus, 
the p-margin squared hinge loss is (2r/p 2 )-admissible with 4> P (0) = 1 for 
all p. This leads to the following corollary. 

Corollary 2. Let I denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances xi, . . . ,xt G R^ with ||xt|| < r for all t G [1,T]- For any p > 
and any u G with ||u|| < 1, consider the vector of p-margin squared 
hinge losses incurred by u: L p (u) = [(1— teI - Then, the num- 

ber of updates Mt = |/| made by the Perceptron algorithm can be bounded 
as follows: 



M T < inf [|L„(u)||i + — * = . (5) 

P >o||u||<i" ,y p 2 



This also implies 



2r 2 ^ 2 



M T < inf _ + v /||L p (u)||i ) . (6) 

p>0,||u||<l \ p 2 ) 

Theorem 2 can be similarly used to derive mistake bounds in terms of 
other admissible losses. 



3.2 12-norm mistake bounds 



The original results of this section are due to Freund and Schapire [1999]. 
Here, we extend their proof to derive finer mistake bounds for the Per- 
ceptron algorithm in terms of the Z/2-norm of the vector of hinge losses 
of an arbitrary weight vector at points where an update is made. 

Theorem 3. Let I denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances xi, . . . , xt G R^. For any p > and any u G R^ with ||u|j < 1, 



consider the vector of p-hmge losses incurred by u: L p (u) 



[(1 



Then, the number of updates Mt — \L\ made by the Per- 



ceptron algorithm can be bounded as follows: 
( 



M T < inf 

p>0,||u||<l 
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(7) 



// we further assume that ||x t || < r for all t G [1, T], for some r > 0, this 
implies 

Mt< inf f- + ||L p (u)|| 2 ^ . (8) 
p>0,||u||<l VP / 



Proof. We first reduce the problem to the separable case by mapping 



each input vector x t G R JV to a vector in x' t G R + as follows 



Xt,l 



Xt.N. 



X t ,l 



Xt.N ... 



A 

(AT + t)th 
component 



. . . 



where the first N components of Xj coincide with those of x and the 
only other non-zero component is the (N + t)th component which is set 
to A, a parameter A whose value will be determined later. Define It by 
It = (1 — yt "' xt )ltei- Then, the vector u is replaced by the vector u' 
defined by 



"w yihp 
Z AZ 



AZ 



The first iV components of u' are equal to the components of u/Z and 
the remaining T components are functions of the labels and hinge losses. 
The normalization factor Z is chosen to guarantee that ||u'|| = 1: Z — 



1 + 



P 2 l|L P (u)ll 2 

At 



Since the additional coordinates of the instances are 



non-zero exactly once, the predictions made by the Perceptron algorithm 
for x[, ! 6 [1>T] coincide with those made in the original space for x t , 
t G [1, T]. In particular, a change made to the additional coordinates of 
w' does no affect any subsequent prediction. Furthermore, by definition 
of u' and x.' t , we can write for any t G I: 



u ■ x t 



Z 

i/tU ■ Xt 



,ytkp 

ZA 



kp 
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> ytu ■ Xj p-yt(u-xt) 



£_ 

Z' 



where the inequality results from the definition of l t . Summing up the 
inequalities for all t G I and using Lemma 1 yields Mt-% < ~^2 teI yt(u ■ 

x£) < \J^2 teI || x tll 2 - Substituting the value of Z and re- writing in terms 
of x implies: 



Ml < [ + 



l|Lp(u)|| : 



A 2 

|Lp(u) 



R 2 + M T A 2 



+ \ +M 1 



L P (u) 



where R = yXtg/ 
gives A 2 



xt|| 2 . Now, solving for A to minimize this bound 



p|l L p(u)||-R fyj^jjgj. simplifies the bound 



o R 2 
M}<^- + 2- 



Hfc\\L p (u)\\R 



M T ||L p (u)|| ; 



(| + VJWt||L p (u)|| : 



Solving the second-degree inequality Mt — <JMt\\ h p (u) ||2 — ^ < proves 
the first statement of the theorem. The second theorem is obtained by 
first bounding R with r-s/Mr and then solving the second-degree inequal- 
ity. □ 



3.3 Discussion 

One natural question this survey raises is the respective quality of the 
L\- and Z/2-norm bounds. The comparison of (4) and (8) for the p-margin 
hinge loss shows that, for a fixed p, the bounds differ only by the following 
two quantities: 

min ||L p (u)||i = min VVl - y t (u ■ x t )/p) 

U <1 U <1 ^ 

II II _ II II _ (g/ 

min ||L p (u)||l = min V(l - y t (u ■ x t )/p) 2 . 

u <1 u <1 — ^ 

II ll_ II ll_ teI 

These two quantities are data-dependent and in general not comparable. 
For a vector u for which the individual losses (1 — j/t(u • xt)) are all less 
than one, we have ||L p (u)||2 < ||L p (u)||i, while the contrary holds if the 
individual losses are larger than one. 



4 Generalization Bounds 

In this section, we consider the case where the training sample processed 
is drawn according to some distribution D. Under some mild conditions 
on the loss function, the hypotheses returned by an on-line learning algo- 
rithm can then be combined to define a hypothesis whose generalization 
error can be bounded in terms of its regret. Such a hypothesis can be 
determined via cross-validation Littlestone [1989] or using the online-to- 
batch theorem of Cesa-Bianchi et al. [2004]. The latter can be combined 
with any of the mistake bounds presented in the previous section to 
derive generalization bounds for the Perceptron predictor. 
Given 5 > 0, a sequence of labeled examples (xi, j/i), . . . , (j/t, xt), a 
sequence of hypotheses hi, . . . , Ht, and a loss function L, define the pe- 
nalized risk minimizing hypothesis as h = hi* with 

1 / log T(T+1) 
i* = argmin >^ L(y t hi(x t )) + \ —, r- 

« 6 [i,T] r-i + i^j \2{t-i + i) 

The following theorem gives a bound on the expected loss of h on future 
examples. 



Theorem 4 (Cesa-Bianchi et al. [2004]). Let S be a labeled sample 
((xi, j/i), . . . , (xt, Vt)) drawn i.i.d. according to D, L a loss function 
bounded by one, and hi, ... , /it the sequence of hypotheses generated by 
an on-line algorithm A sequentially processing S. Then, for any 8 > 0, 
with probability at least 1 — 8, the following holds: 

E \L(yh(x))] < i £ L( W Ai(xi)) + 6^ log *!ZL+Il . ( 9 ) 



Note that this theorem does not require the loss function to be convex. 
Thus, if L is the zero-one loss, then the empirical loss term is precisely 
the average number of mistakes made by the algorithm. Plugging in any 
of the mistake bounds from the previous sections then gives us a learn- 
ing guarantee with respect to the performance of the best hypothesis 
as measured by a margin-loss (or any 7-admissible loss if using Theo- 
rem 2). Let w denote the weight vector corresponding to the penalized 
risk minimizing Perceptron hypothesis chosen from all the intermediate 
hypotheses generated by the algorithm. Then, in view of Theorem 2, the 
following corollary holds. 



Corollary 3. Let L denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances xi, . . . ,xt £ R • For any vector u £ R^ with ||u|| < 1 and any 
^-admissible loss function </> 7 , consider the vector of losses incurred by 
u; L^(u) = [<j>j(yt(u. ■ x t))] tgJ - Then, for any 8 > 0, with probability 
at least 1 — 8, the following generalization bound holds for the penalized 
risk minimizing Perceptron hypothesis w; 



Pr [j/(w ■ x) < 0] 



< inf 

7>0,||u||<l 



Wu)||i , 7\/E, e /l|x t 



<M0)T 



+ 



<M0)T 



+ 6 



1, 2(T + 1) 
T l ° S ^S— 



Any 7-admissible loss can be used to derive a more explicit form of this 
bound in special cases, in particular the hinge loss or the squared hinge 
loss. Using Theorem 3, we obtain the following Z/2-norm generalization 
bound. 



Corollary 4. Let L denote the set of rounds at which the Perceptron 
algorithm makes an update when processing a sequence of training in- 
stances xi, . . . ,xt G R ■ For any p > and any u G R^ with ||u|| < 1, 
consider the vector of p-hinge losses incurred by u; L p (u) = [(1 — 
yt ^"' xt - 1 )+] . Then, for any 8 > 0, with probability at least 1 — 5, the 
following generalization bound holds for the penalized risk minimizing 



KernelPerceptron(qio) 

1 a <— ao > typically cxo = 

2 for t <- 1 to T do 

3 Receive^) 

4 y t 4- sgn(X)J =1 a s y s K(x s ,x t )) 

5 RECEIVE(yt) 

6 if (y t yt) then 

7 cti+i <— at + 1 

8 else a t+ i ^— a f 



9 return a 



Fig. 2. Kernel Perceptron algorithm for PDS kernel K. 



Perceptron hypothesis w: 



5 Kernel Perceptron algorithm 

The Perceptron algorithm of Figure 1 can be straightforwardly extended 
to define a non-linear separator using a positive definite kernel K [Aizer- 
man et al., 1964]. Figure 2 gives the pseudocode of that algorithm known 
as the kernel Perceptron algorithm. The classifier sgn(/i) learned by the 
algorithm is defined by h: x i— > Ylt=i a tVtK(xt,x). The results of the 
previous sections apply similarly to the kernel perceptron algorithm with 



appearing in several of the learning guarantees can be replaced with the 
familiar trace Tr[K] of the kernel matrix K = [K(xi, Xj)]i,jei over the set 
of points at which an update is made, which is a standard term appearing 
in margin bounds for kernel-based hypothesis sets. 



Pr [y(w ■ x) < 0] 
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